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Abstract. In the multiarmed bandit problem a gambler chooses an arm 
of a slot machine to pull considering a tradeoff between exploration and 
exploitation. We study the stochastic bandit problem where each arm 
has a reward distribution supported in a known bounded interval, e.g. 
[0, 1]. For this model, policies which take into account the empirical vari- 
ances (i.e. second moments) of the arms are known to perform effectively. 
In this paper, we generalize this idea and we propose a policy which ex- 
ploits the first d empirical moments for arbitrary d fixed in advance. 
The asymptotic upper bound of the regret of the policy approaches the 
theoretical bound by Burnetas and Katehakis as d increases. By choos- 
ing appropriate d, the proposed policy realizes a tradeoff between the 
computational complexity and the expected regret. 



1 Introduction 

The multiarmed bandit problem is one of the formulations of the tradeoff be- 
tween exploration and exploitation. This problem is based on an analogy with 
a gambler playing a slot machine with more than one arm. The gambler pulls 
arms sequentially so that the total reward is maximized. 

We consider a K-axmed stochastic bandit problem originally considered in 
PQ . There are K arms and each arm i = 1 , • • • , K has a probability distribution 
Fi with the expected value Lii- The gambler chooses an arm to pull based on a 
policy and receives a reward according to Fi independently in each round. We 
call an arm i optimal if //$ = li* and suboptimal if fii < li* . Then, the goal of 
the gambler is to maximize the sum of the rewards by pulling optimal arms as 
often as possible. Many researches have been conducted for the stochastic bandit 
problem [2J [3J H] [5J H3 [7] as well as the non-stochastic bandit [5J [§] ■ 

In this paper we consider the model J 7 , the family of distributions with sup- 
ports contained in the bounded interval [0,1]. The gambler knows that each dis- 
tribution Fi is included in J 7 . For this model Upper Confidence Bound (UCB) 
policies are popular for their simple form and fine performance (101 111] . Re- 
cently Honda and Takemura |12] proposed Deterministic Minimum Empirical 
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Divergence (DMED) policy which satisfies for arbitrary suboptimal arm i that 



E[T,wls liww +o(1) r 6 " (1) 

where Tj(n) denotes the number of times that arm i has been pulled over the 
first n round and 

D mia {F,»)= min D(F\\G) 

GG^:E G [X]>f, 

with Kullback-Leibler divergence D(-||-). DMED is asymptotically optimal since 
the coefficient of log n on the right-hand side of ([IJ coincides with the theoret- 
ical bound given in [T3]. However, the complexity of the DMED policy is still 
larger than e.g. UCB policies, although the computation involved in DMED is 
formulated as a univariate convex optimization. It is mainly because DMED re- 
quires the empirical distributions of the arms themselves whereas other popular 
policies can be computed by the moments of the empirical distributions of the 
arms, such as means and variances. 

Now, our question is how we can bring the performance close to the right- 
hand side of (HJ by a policy which only considers the first d empirical moments 
of the arms at each round. In this paper, we propose DMED-M policy which is 
a variant of DMED and is computable only by the empirical moments of the 
arms. For arbitrary suboptimal arm i, DMED-M satisfies 

mw] < (- — (f^ +^h lQ g" ' ( 2 ) 

\ mt F<£j r :EW(F)=E( d )(F i ) ^mmK-f, M ) ) 

where E (d \F) = (E Fi [X], ■ ■ ■ ,E Fz [X d ]) denotes the first d moments of F and 
this upper bound approaches ([T]) as d — > oo. 

DMED-M is obtained by an analogy with DMED. Intuitively, DMED exploits 
the fact that the maximum likelihood that the arm with empirical distribution 
Fi is actually the best is roughly exp(— tD m i n (F i: fi*)) for number of samples t. 
When ignoring properties of the distribution Fi except for its first d moments, 
we overestimate the maximum likelihood as 

exp(-t inf D miD (F,Li*) 

V FeJ r :E( d >(F)=EW(F t ) 

instead of exp(— tD m i n (Fi, fi* j) and the bound flU appears correspondingly. 

In DMED-M, it is necessary to compute mi F e jr : E<< 1 )(_F )=(Mi ,— ,M d ) D m i n (F, /x) 
for each round. Classical results on Tchebysheff systems and moment spaces 
reveal that F attaining the infimum is determined only by the value of the first 
d moments (Mi,-- - ,Md) when the objective function -D m in( is included 
in a particular class. Therefore the infimum is obtained by computing firstly 
the optimal solution F and then the value of the function D m i n (F, fi). Both 
are obtained by solving polynomial equations and DMED-M can be computed 
efficiently for small d. 



This paper is organized as follows. In Sect. [21 we give definitions used through- 
out this paper. We propose DMED-M policy in Sect. [3] In Sect.[U we study the 
minimization of -D m in over distributions whose first d moments are common for 
a practical implementation of DMED-M. Proofs of results in Sects.[3]and[Hare 
given in Sect.[5l In Sect.[6j we discuss an improvement of DMED-M in terms of 
the worst case performance. We present some simulation results on DMED-M 
in Sect. [7] We conclude the paper with some remarks in Sect. [5] 



2 Preliminaries 

Let T be the family of probability distributions on [0, 1] and Fi £ T be the 
distribution of the arm i = 1, . . . , K. E F [•] denotes the expectation under F G T . 
When we write e.g. Ef[u(X)] for a function u : 1R — > IR, X denotes a random 
variable with distribution F. A set of probability distributions for K arms is 
denoted by F = (Fi,...,Fk) G F k = Ylf = iF. The expected value of arm 
i is denoted by fXi = Ej^ [X] and the optimal expected value is denoted by 
/i* = max,; fii . 

Let Tiin) be the number of times that arm i has been pulled through the 
first n rounds. Fi(n) and p-i(n) denote the empirical distribution and the mean 
of arm i after the first n rounds, respectively. fi*(n) = maXj/ij(n) denotes the 
highest empirical mean after the first n rounds. We call an arm i a current best 
if ik{ri) = P>*(n). 

Now we review results in [T^]. Define an index for F E T and /j, £ [0, 1] 
D min (F,fi) = min D(F\\G) , 

GeJ r :E(G)>/i 

where Kullback-Leibler divergence D(F\\G) is given by 

'M 1 ^] 3§ exists, 
+oo otherwise. 



D(F\\G) 



Under DMED policy proposed in [T^], the expectation of T,-(n) for any subopti- 
mal arm i is bounded as 

E F [Ti(n)]< 1- t, £ logn + 0(l) (3) 

where e > is arbitrary. The coefficient of the logarithmic term l/L' m i n (F i , fj,*) is 
the best possible [13] and the following property holds for the function D m ; n (F, fi). 



Proposition 1 ([HI Theorems 5 and 8]). IfE F [X] > fj, then D min (F, 
0. If~E F [X] < \l = 1 then D min (F, n) = oo. If~E F [X] < \i < 1, 

D m in{F, n) = max E F [log(l - (X - ji)v)} 



— — l-ju 



Ei 



i-x 



E F [log(l-X)]-log(l- M ) 
max 0<I/< j_ Ei?[log(l — (X — fJ,)v)] otherwise, 



< 



where we define logO = — oo. 



Let F,^{F) = (E F [X], ■ ■ ■ ,E F [X d }) denote the first d moments of F. The 
set of distributions with the first d moments equal to M = (Mi,-- - , M4) is 
defined as F{M) = {F e T : E< d )(F) = M}. We sometimes write instead 
of M to clarify the length of the vector. 

Now define D^} n (M,fi) by 



This function plays a central role throughout this paper. 



3 DMED-M Policy 

In this section we introduce DMED-M policy. This policy determines an arm to 
pull based on the empirical moments of the arms. DMED-M requires computa- 
tion of the function D^f) and we analyze this function in the next section. 

In the following algorithm, each arm is pulled at most once in one loop. 
Through the loop, the list of arms pulled in the next loop is determined. Lc 
denotes the list of arms to be pulled in the current loop. Lm denotes the list 
of arms to be pulled in the next loop. Lr C Lc denotes the list of remaining 
arms of Lc which have not yet been pulled in the current loop. The criterion for 
choosing an arm i is the occurrence of the event Ji(n) given by 

J,(n) = {T, (n)D ( ^ n (E (d) (F (n) ) , /}* (n) ) < logn - logT,(n)}, (4) 

where EW(Fj(n)) represents the first d empirical moments of arm i. 

[DMED-M Policy] 
Parameter. Integer d > 0. 

Initialization. Lc,Lfj := {1, • • • ,K}, Ljy := 0- Pull each arm once. 

n := K. 

Loop. 

1. For i e Lc in the ascending order, 

1.1. ri := ri + 1 and pull arm i. Lr := Lr \ {i}. 

1.2. Ljy := ijv U {j} (without a duplicate) for all j ^ Lr such that 
Jj(n) occurs. 

2. Lc, Lr :— Ln and Ljv := 0- 

As shown above, \Lc\ arms are pulled in one loop. At every round, arm i is 
added to Ln if Ji(n) occurs unless i S Lr, that is, arm i is planned to be pulled 
in the remaining rounds in the current loop. Note that if arm i is a current best 
for the n-th round then Ji{n) holds since D m [ n (& m ' (Fi(n)) , fi* (n)) = for this 
case. Then Lc is never empty. Note that DMED in [12] is obtained by replacing 
D^ a (E^(F i (n)),fi*(n)) in Q by D min (F, t {n)^). In view of Theorem [2] below, 
DMED can be regarded as DMED-M with d = 00. 



Theorem 1. Fix F 6 F K for which there exists a unique optimal arm j . Under 
DMED-M policy, for any suboptimal arm i and e > it holds that 

E F [T,(n)] < 1 J" 6 -logn + 0(l) 

where 0(1) denotes a constant dependent on e and _F 6«t independent of n. 

This theorem can be proved in a similar way as Theorem 4 of |12) with the 

fact that Dmin (F, /j,) > 0^(E^(F),/i) always holds. However, we omit the 
proof because it is long and very similar to the proof of Theorem 4 of ! 12] . The 
bound in Theorem [1] approaches that of DMED given by ([3]) as d — > oo from the 
following theorem, which we show in Sect. [3] 

Theorem 2. For arbitrary F e F it holds that 

lim ^t(E (d) (^),M)-^mi„(F,/i) ■ 



4 Practical Representation of D^ n 

For a computation and a theoretical evaluation of DMED, it is essential to 
analyze the function D^ n (M, /i) = inf p^jr.^d) (f)=m -Dmin(-F 1 , m)- I n this section 
we study an explicit representation of this function. 

The following theorem is the main result of the paper. In this theorem, we 
identify a pair ({a;,}, {/,*}) with a discrete distribution such that F({xi}) = fa. 

Theorem 3. If F(M) = {F e_F : E^(F) = M} is nonempty then there 
exists a unique optimal solution F G F such that 

D ( ±{M, /i) = inf D min (F, M) = D min (F, /i) . (5) 

Furthermore, F ^ F is the unique optimal solution F if and only if ((xi, • ■ ■ , x/), 
' ' ' j //)) / or ' = + 1 is a solution of 



fZ)Li /* x r = M « (m = 0, • ■ ■ xi = 0, x; = 1, d is odd, 
\Yh=i fi x T = M m (m = 0, • • • , d), xi = l, d is even, 

where we define the zeroth moment as Mq = 1. 



(6) 



Note that the above F only depends on the moment M. Then, the value of 
D^ n (M,fi) is obtained by computing F first and then D m i n (F, fi). Recall that 
D m i n (F, /j,) — max < zy <(i_ A1 )-i E^[log(l — (X — fi)u)]. Since F has finite sup- 
port {xi, ■ ■ ■ ,xi}, the optimal solution v* attaining the maximum is one of the 
boundary points 0, (1 — t 1 ) 1 or an interior point v e (0, (1 — t 1 ) 1 ) such that 

d _ r . n lY . „ ELiCM-^n^iCl- (xj -li)v) 
— E^ log 1 -{X- n)v)] = — j— - — ^ — = , 7 



which is obtained by solving the Z-th degree polynomial equation. We give an 
explicit form of D^ n (M, /i) for d = 1, 2, 3 in the following theorem. 

Theorem 4. If M\ < \i < 1 and T(M) is nonempty, then D^- m (M,fj,) is 
expressed for d = 1, 2 as 

mi 1 - Mi Mi 

d£L(m, = a - M -77^7 + M lo s — . 



M 2 - M^ 



2 



w/iere 

,(2) 



1 - 2Mi + M 2 

(l-Afi)(Mi-/i) 



log(l-(l-/i)^) 



(1 - Mi)/i 2 - (1 - M 2 )/i + Mi - M 2 ■ 
For d — 3 it is expressed as 

O) , n/ r ,a _ J^2L(A*> A*)- Mi = M 2 - M 3 , 



where 



J2i=i fi l°g(! ~ xiv^) otherwise, 



f - - - , ( M 2 -M 3 
(x u x 2 ,x 3 ) = \-h — _ — — fj.,1 — (J, 

, , , _ „,„\3 1\/T- 1\/T n^2 

(/a,/ 3 ) = 



(Mi - M 2 ) 3 MiM 3 -M| \ 

(M 2 - M 3 ) (Mi - 2M 2 + M 3 ) ' ATi - 2Af 2 + M 3 J ' Jl H J 



3, 

(8) 



(3) _ J ' a r u > 



If. a = ' 

/or 

(a, 6, c) = (xix 2 S 3 , (M 2 - 2/iMi + + (Si + x 2 + x 3 )(/x - Mi), n - M x ) . 
This theorem is obtained by solving ([7|) with F = given in Lemma [1] below. 

Lemma 1. If J-(M) is nonempty then the solution F^ of ([6]) is expressed for 
d = 1, 2, 3 as 

F« = (1 - Mi)<5 (0) + Mi S (1) , 

Jo-(l) Mi = M 2 = l, 

* - \ (l-Ahf g ( Mi-M 2 \ , M 2 -M* g (1) Qtherwise 

{i-2M 1 +M 2 °\ l-Mi J + i-2Mi+m 2 ° ^ owierwibe, 
- (3) _p (1) Mi=M 2 = M 3) 

(/i^(0) + / 2 <5(||5^)+/ 3( 5(l) otherwise, 



where 5 (x) denotes the delta measure at x and (/i,/2,/3) is given by ([8j). 
This lemma can be confirmed easily by substitution of (d — 1, 2, 3) into ©• 



5 Proofs 

In this section we show Theorems [2] and [3] 
5.1 A Proof of Theorem [2] 

Theorem [2] is proved by a basic result on weak convergence and Levy distance 
(see, e.g., [2]). We say that a sequence of probability distributions {Fi} con- 
verges weakly to F if limi_>. 00 Ep i [w(X)] = Ef[u(X)} for all bounded, continuous 
function u(x). Define the Levy distance L(-, •) as 

L(F, G) = inf {ft > : Vx, F(x ~ h) - h < G(x) < F(x + h) + h} 

where F(-) and G(-) denote cumulative distribution functions. A weak conver- 
gence is equivalent to the convergence of the Levy distance, that is, {Fi} con- 
verges weakly to F if and only if limi_ i . 00 L(F{, F) = 0. 

Proposition 2 ([12, Theorem 7]). Z? m ; n (F, /i) is continuous in F G J 7 ruifft 
respect to the Levy distance. 

Now we show Theorem [5] by Prop.[2J 

Proof (of Theorem^). From the continuity of D m in(-F , A*) in -F\ it suffices to show 
for M< d > = E( d )( J F) that 

lim sup sup L(G,F) = . (9) 

Let {Grf G 7 r (Af( d ')} ( i = i ! 2,..- be a sequence such that 

lim sup sup L(Fd, F) = \ims\rp L(Gd, F) =: L . 

Since J 7 D T{M^) is compact with respect to the Levy distance, there exist 
Ge J and a convergent subsequence {G^} of {G^} such that 

lim L(G dl ,F)=L , (10) 
lim L{G di ,G) =0 , (11) 

where (jlll) means that {G^} converges weakly to G. From the definition of weak 
convergence, for all natural numbers m £ IN it holds that limi_ i . 00 Ec d . [X m ] = 
E a [X m ]. On the other hand, E Gdi [X m ] = E F [X m ] for all d t > m from G d! £ 
J"(M^). Therefore we obtain for all m G IN that 

E F [X m ] = lim E Gd .[X m ] = E e [X m ] . 
i— voo 1 

Note that a sequence of moments {E^AT" 1 ]} has one-to-one correspondence 
to a distribution F for the case of bounded support. Therefore G — F and we 
obtain L = from ([TO]). □ 



5.2 A Proof of Theorem [3] 

Theorem [3] is proved by theories of Tchebysheff systems and moment spaces (see 
Appendix), and the basic result on a saddle-point in the following. For a function 
<p(x, y) : X x y —> [— oo, +oo], a point (x, y) E X x y is called a saddle-point if 
fix^y) < (p(x,y) < tp(x,y) for all x 6 X and y 6 J 7 . A necessary and sufficient 
condition for a saddle-point is 

sup tp(x, y) — inf sup (p(x,y) = sup inf ip(x,y) = inf <p(x,y) . 

Proposition 3 (Minimax Theorem [15j). Let X andy be a compact subset 
of a topological vector space V and 14. Let <p(x, y) : X x y — > [-co, +oo] be a 
function such that ip(-, y) is convex and lower- semicontinuous for any fixed x and 
<p(x, •) is concave and upper- semicontinuous for any fixed y. Then there exists a 
saddle point (x,y) G X x y. 

In the proof of Thcorem[3l we regard a probability measure F as an element of 
the family V of positive measures on [0, 1] to exploit the results in Appendix. By 
letting M := (1,M U ■ • • , M d ) for M = (Mi, • • • , M d ), D^J n (M, fx) is rewritten 

as 

D miD {M,ft)= inf max Ej. [log(l - (X - y) V )} , (12) 
Fev(M) ^<v<jhj: 

where V(M) is the set of positive measures with 0, 1, • • • , d-th moments equal 
to M, written in (fig]). 

Proof (of Theorem^. Let Aid+i be the moment space with respect to the 
system (1, x, ■ ■ ■ , x d ). It is easily checked that the solutions of © have one-to- 
one correspondence to the representations of M G M.a+i with index at most 
d/2 or to the upper principal representation. 

First consider the trivial case that M is a boundary point of A4d+i- For this 
case, the proof is straightforward since V(M) has a single element and its index 
is at most d/2 from Prop.[5] 

Now we consider the remaining case that M is an interior point of Aid+i- 
For this case, the upper principal representation of M is the unique solution of 
([S]) since existence of a representation with index at most d/2 implies that M is 
a boundary point of Md+i from Prop.[5j In the following, we complete the proof 
by showing that the upper principal representation of M is the optimal solution 
Fin©. 

Consider applying the minimax theorem to (fT2|) . First, T D V(M) is compact 
with respect to the Levy distance and Ej?[log(l — (X — li)v)\ is linear inFeV 
for any fixed v. Next, E^ [log(l — (X — t 1 )^)] is upper-semicontinuous and concave 
in v for any fixed F . Then we obtain from the the minimax theorem that there 
exists v satisfying 

D min (M, n) = inf E F [log(l - (X - fx)9)} . 

FeV(M) 



Now we show v < (1 — p) 1 by contradiction. Assume v = (1 — p) . From 
Prop.[5]with x* := 1, there exists F' € V(M) such that F'{{1}) > 0. Therefore, 
from logO = — oo, 

inf E F [log(l - (X - n)v)] < E F /[log(l - (X - p)v)\ = -oo . 

FeV(M) 

It contradicts the positivity of D^ n and 9 < (1 — /j) -1 is obtained. 

From Lemma[3]with p := l+p9 and g := z/, (1, x, ■ ■ ■ , x , — log(l — (x— p)9)) 
is a T-system on [0, 1] since p/q = 1/v+p > 1. Therefore, from Prop. [51 we obtain 

inf E F [log(l - (X - fi)9)} = - sup E F [-log(l-(X-//)2?)] 

= -E p [-\og(l-(X-p)9)} 

= E F [log(l - (X - = D min (F, p) , 

where F corresponds to the upper principal representation of M . □ 



6 Improvement of DMED-M Policy 

In DMED-M, D min (F, p) is bounded from below by D^ n {E^(F), p). When the 

gap between D^J n and D min is small, DMED-M behaves like the asymptotically 
optimal policy, DMED. In this section, we propose DMED-MM policy which 
is obtained by a slight modification to DMED-M. We discuss that DMED-MM 
works successfully for the case where the gap between D^ n and Z? m in is large. 
Define a function D^ n (F,p) by 

D (d) (F u) = l Dmin ( F >^ Ep [t=*J - T^' 

W*L(V {d) (F),p) otherwise, 

where recall that D m j n (F, /i) = E F [log(l — X)} — log(l — p) for the first case. 
DMED-MM (DMED-M Mixed) policy is obtained by replacing D^ n (E^ (F (n)), 
p) in DMED-M by D^ n (Fi(n), p). Then the criterion for choosing an arm is the 
same as DMED for the case E^,, Jl/(1 — X)] < 1/(1 — p) and the same as 
DMED-M otherwise. 

In the first place, D^ n (E( d )(Fi(n)), p) in DMED-M is easy to compute since 
the empirical moments (Fi(n)) can be obtained in constant time from the 
sum Y^t=i^ where X^t denotes the t-th reward from arm i. On the other 
hand, D m i n (Fi(n), p) = max„E f ^[1 — (X — p)u] in DMED requires the com- 
putation of the sum ^ t log(l — {X itt — p)v) where p and v generally take various 
values. In this viewpoint, the computation of D^L(Fi(n) , p) is practical since it 
is obtained from the sums ^ t X™, J^t V(l ~ ^i,t) an d J2t l°g(l — ^i.t)- 

Now consider the maximum gap between D^- n (E^(F), p) and D m \ n {F, p) 
among distributions F with moment M . 



Lemma 2. The supremum in 



SUP Anin(i^) ~ ^nin(^) 
F£F(M) v ' 

is attained by the unique solution of 
E^ 1)m fixT = M m (m 

eI£1 +1)/21 /i*r = ("» 



= sup D min (F^)-D^(M,fj,) (13) 



= 0, ■ • • , d), d is odd, 

= 0, • • • , d), x\ = 0, c? is even. 



The proof of the lemma is parallel to that of Theorem [3] which considers the 
infimum of D m i n (F, /i). In Theorem [3j we saw that the upper principal represen- 
tation F of M attains the infimum from Prop. [8] Similarly, we can show that 
the lower principal representation F_ of M attains the supremum in (|13[) . 

Note that we can obtain an explicit expression of (fT3")) for small d in the same 
way as TheoremHJ However, such an expression is a complicated function on M 
as Theorem [5] and it is not useful as an evaluation of the gap, since we cannot 
know the value of the expression until we substitute the specific value of M. 

Lemma [H is useful when we consider the performance of DMED-MM. Com- 
pare the solution F_ of (fTl")) to the upper principal representation F in ©. For 
odd d, F is supported by fewer points which generally contain neither nor 
1. For even d, F_ is supported by the same number of points which contain 
instead of 1. In any case, we can say qualitatively that F_, distribution such that 
DMED-M behaves badly (i.e., gap between £> m m and -O^in is large), has small 
weight around 1. 

Note that such a distribution often satisfies Ei?[l/(1 — X)] < 1/(1 — /i) since 
Ej?[l/(1 — X)] is controlled mainly by the weight around 1. In fact, we can show 
from Prop. [8] that min-Fe^iM) [1/(1 — X)] is attained by the lower principal 
representation F, which also attains the supremum in (| L3[) . 

Now we summarize the above argument: (1) DMED-M behaves most differ- 
ently from DMED for F_ among distributions with the moment M. (2) Among 
these distributions F, F_ is also the distribution minimizing Ei?[l/(1 — X)}, al- 
though the minimum value is not always smaller than 1/(1 — /i). (3) If Ep [1/(1 — 
X)} < 1/(1 - /x), DMED-MM behaves in the same way as DMED (otherwise it 
does in the same way as DMED-M). In this sense, the worst case gap between 
DMED-MM and DMED is sometimes smaller than that between DMED-M and 
DMED. 



7 Experiments 

In this section we show some numerical results on DMED-MM and the function 

min 

First, we compare the performance of 1, 2, 3-th degree DMED-MM and DMED 
in Fig.[T] Each plot is an average over 1000 different runs and we used 5 arms with 
beta distributions Be(a, f3). Note that beta distribution covers various forms of 
distributions on [0, 1]. The parameters of the arms are (a, (3) = (9, 1), (0.7, 0.3), 




100 1000 10000 100000 
plays 



Fig. 1. Empirical regrets of 1,2, 3-th degree DMED-MM and DMED for beta distri- 
butions. Each plot is an average over 1000 different runs. 

Table 1. Values of D^(E (li) (F),/i) and D min (F, [i) for beta distributions. 
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E F [X] 


A 4 


£> (1 ? 

mm 


D {2) 

mm 


£> (3) 

mm 


^min 




Be(2,2) 


0.5 


0.6 


0.0204 


0.0703 


0.0843 


0.0984 


False 


Bc(0.5,0.5) 


0.5 


0.6 


0.0204 


0.0366 


0.0400 


0.0408 


False 


Be(l,3) 


0.25 


0.6 


0.253 


0.459 


0.522 


0.583 


True 


Be(0.25,0.75) 


0.25 


0.6 


0.253 


0.348 


0.391 


0.431 


False 


Bc(2,2)/2 


0.25 


0.3 


0.00617 


0.0373 


0.0490 


0.0576 


True 


Be(0.5,0.5)/2 


0.25 


0.3 


0.00617 


0.0239 


0.0337 


0.0401 


True 



(5, 5), (0.3, 0.7), (1,9) with expectations n = 0.9, 0.7, 0.5, 0.3, 0.1. The vertical 
axis denotes the regret X}j:u 4 <u* (M* — Hi)Ti(n), which is the loss due to choosing 
suboptimal arms. We see from the figure that the performance of DMED-MM 
approaches DMED as the degree increases. 

Next, we show values of D^ n and whether E^[l/(1 — X)] < 1/(1 — fi) or not 
for various distributions in Table [TJ Recall that DMED-MM works the same as 
DMED for the case E F [1/(1-X)} < 1/(1-/1). Distribution Be(a, /3)/2 denotes 
the distribution of X/2 for random variable X with distribution Be(a,/3). It 
corresponds to the case that the upper bound of the support of distributions is 
unknown and assumed conservatively as 2 instead of 1. For this case, a reward 
X is passed as X/2 to a policy for distributions on [0, 1]. We see from the figure 
that D^ n bounds Anin from below accurately when E^[l/(1 — X)] < 1/(1 — fi) 



is false, as discussed in the previous section. Overall, the gap between -D^ in and 
^min seems to be very large and it seems to be necessary to use at least the 
second moment (i.e., variance) to achieve a smaller regret. 

8 Conclusion 

In this paper we proposed DMED-M policy which is computed by the first d 
empirical moments of the arms. The regret bound of DMED-M approaches that 
of DMED, which is asymptotically optimal, as d increases. The computation 
involved in DMED-M is represented in an explicit form for small d. We also pro- 
posed DMED-MM policy, which sometimes improves the worst case performance 
of DMED-M. 

An open problem is whether the asymptotic bound of DMED-M is the best 
for all policies which only consider the empirical moments. We may be able to 
prove the optimality of DMED-M in this sense under some regularity conditions. 
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A Tchebycheff Systems and Moment Spaces 

In this appendix we summarize results on Tchebycheff systems and moment 
spaces needed for the proof of Theorem[31 All functions and measures are defined 
on [a, b] (a < b) in this appendix whereas they are on [0, 1] elsewhere. For any 
set of points {x\, ■ ■ ■ ,x{\, we always assume a < x\ < x-i < ■ ■ ■ < xi < b. 

Definition 1. Let Uo(x), • ■ ■ ,Ud{x) denote continuous real-valued functions on 
[a,b]. These functions are called a Tchebycheff system (or T-system) if determi- 
nants 



dot 



I u (x ) ui(x ) ■ ■ ■ u d (x d ) ^ 
uo(a;i) ui{xi) ■ ■ ■ u d (x d ) 



(15) 



\u Q (x d ) ui(x d ) ■ ■ ■ u d (x d ) j 

are positive for all {xo, • • ■ , x d }. 

A typical T-system is Ui(x) = x % [i = 0, 1, • • • ,d), where (TT5j) is represented 
as the Vandermonde determinant 



dct 



/ 1 x 

1 X\ 



•' 

a 1 



\<i<j<d 



Let Z(u) of a function u(x) denote the number of distinct points x G [a, b] 
such that u(x) = 0. Then T-systems are discriminated by the following proposi- 
tion. 

Proposition 4 ([16, Chap. I, Theorem 4.1]). // a system {ui}f =Q of con- 
tinuous functions on [a, 6] satisfies Z{u) < d for all 



i=0 



then (uq, m, ■ ■ ■ , u d ) or (uq,Ux, ■ • ■ , -u d ) is a T-system. 

Lemma 3. For any p and q > satisfying b < p/q, (1, x, ■ ■ ■ , x d , — log(p — qx)) 
is a T-system on [a, b] . 



Proof. Let b' e (b, p/q) be sufficiently close to p/q and consider function 

d 

u{x) = 2J a m x m - a d +i log(p - qx) 



on a; £ [a, b'\. Since the derivative of u(x) is written as 



du(x) _ (p - qx) Em=l a r> 



ad+iq 



dx 



p — qx 



u(x) has at most d extreme points in [a, b'\. Therefore Z(u) < d + 1 and 
(1, x, ■ ■ ■ , x d , log(p — qx)) or (1, x, ■ ■ ■ , x d , — log(p — qx)) is a T-system on [a, b'] 
from Prop. |3J 

The determinant (fT5)) for the system (1, x, ■ ■ ■ , x d , log(p — qx)) is written as 



det 



1 1 x ■■■ xft log(p - qx ) ^ 
lii • ■ • xf log(p - 9x1) 



\1 x d+ i ••• a^ +1 log(p- qxd+x) J 

= E(-i) d+m+i ( n fe-^) 1 -, y ,-„, . 



m=0 



For the case that x,j+i = 6' with 6' f p/q, log(p — qx^+i) goes to —00 and the 
sign of the determinant is controlled by the term involving log(p — qxd+i), which 
is written as 



(-1) 



2m+2 



Yl (xj - x{) I log(f> - qxd+i) < 

, 0<i<j<d+l:i,jjtm 



Then, (1, x, ■ ■ ■ ,x d , \og(p — qx)) cannot be a T-system on [a, b'] for b' sufficiently 
close to p/q and therefore (1, x, ■ ■ ■ , x d , — log(p — qx)) has to be a T-system on 
[a, b']. From the definition of T-system, it also is a T-system on [a, b] C [a, b']. □ 

Let V be the family of positive measures on [a, b) and define a subset V(-M) 
of V for a vector M = (M , Mi,-- - , M d ) as 



V(M) = { a e V : Vm e {0, 1, ■ • • , d}, / x m dcr(x) = M. 



(16) 



The notion of moment spaces is essential to examine properties of T-systems. 

Definition 2. The moment space Aid+i with respect to the T-system {ui} is 
given by 

M d +i 

uo(x)da(x), / ui(x)do~(x), • 



, J u^daix^j S TR d+1 : a e vj 



Consider the case that M £ Md+i satisfies 



l 

M m = fc u ™{Xi) (m = 0, • • • , d) (17) 

i=l 

with x\,--- ,xi £ [a, b] and , fi > for any finite I. We call such an 

expression representation of M. A representation of M corresponds uniquely to 
the measure 

i 

a = Y / f i 6(x i )£V 

i=l 

for the delta measure 8{x) at point x. We sometimes identify the measure a with 
the representation of M . The measure a is a probability measure if J^- = 1. 

The index of the representation p7[) is defined as the number of the points 
(xi, ■ ■ ■ ,xi) under the special convention that the points a, b are counted as one 
half. A representation is called principal if its index is (d + l)/2. Furthermore, 
the representation is upper if (x%, ■ ■ ■ ,xi) contains b and lower otherwise. 

For the proof of Theorem |31 it is necessary to study the nature on the set 
V(M). It differs according to whether M is a boundary point of A^+i or an 
interior point of Md+i- 

Proposition 5 ([16, Chap. II, Theorem 2.1]). Every boundary point M has 
a unique representation. Moreover, M £ A4d+i is a boundary point of Md+i if 
and only if there exists a representation of M with index at most d/2. 

Proposition 6 ([16, Chap. II, Theorem 3.1]). If M is an interior point of 
M.d+\ then, for arbitrary x* £ [a,b], there exists a representation of M such 
that (xi, ■ ■ ■ ,xi) contains x* . 

Proposition 7 ([16, Chap. II, Corollary 3.1]). If M is an interior point of 
M.d+i then there exist precisely one upper and one lower principal representa- 
tions of M. 

We use Prop. [7] implicitly in Prop. [8] below and the proof of Theorem [3] Prop.[8] 
is the main result of the appendix. 

Proposition 8 ([16, Chap. Ill, Theorem 1.1]). Assume (uo, Ui, ■ ■ ■ , Ud) and 

(uq,Ui, ■ ■ ■ ,Ud,h) are T-systems. Then 

,b 

max / h(x)da(x) 

creV(M) J a 

is attained uniquely by a, the upper principal representation of M . Similarly, 

f b 

min / h(x)da(x) 

crGV(M-) J a 

is attained uniquely by a, the lower principal representation of M . 



